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Abstract 



Recently, Mahoney and Orecchia demonstrated that popular diffusion-based pro- 
cedures to compute a quick approximation to the first nontrivial eigenvector of 
a data graph Laplacian exactly solve certain regularized Semi-Definite Programs 
(SDPs). In this paper, we extend that result by providing a statistical interpre- 
tation of their approximation procedure. Our interpretation will be analogous to 
the manner in which £2 -regularized or i?i -regularized ^2-regression (often called 
Ridge regression and Lasso regression, respectively) can be interpreted in terms 
of a Gaussian prior or a Laplace prior, respectively, on the coefficient vector of the 
regression problem. Our framework will imply that the solutions to the Mahoney- 
Orecchia regularized SDP can be interpreted as regularized estimates of the pseu- 
doinverse of the graph Laplacian. Conversely, it will imply that the solution to this 
regularized estimation problem can be computed very quickly by running, e.g., 
the fast diffusion-based PageRank procedure for computing an approximation to 
the first nontrivial eigenvector of the graph Laplacian. Empirical results are also 
provided to illustrate the manner in which approximate eigenvector computation 
implicitly performs statistical regularization, relative to running the corresponding 
exact algorithm. 

1 Introduction 

Approximation algorithms and heuristic approximations are commonly used to speed up the run- 
ning time of algorithms in machine learning and data analysis. In some cases, the outputs of these 
approximate procedures are "better" than the output of the more expensive exact algorithms, in 
the sense that they lead to more robust results or more useful results for the downstream practi- 
tioner Recently, Mahoney and Orecchia formalized these ideas in the context of computing the 
first nontrivial eigenvector of a graph Laplacian Recall that, given a graph G on n nodes or 
equivalently its n x n Laplacian matrix L, the top nontrivial eigenvector of the Laplacian exactly op- 
timizes the Rayleigh quotient, subject to the usual constraints. This optimization problem can equiv- 
alently be expressed as a vector optimization program with the objective function f{x) = Lx, 
where x is an n-dimensional vector, or as a Semi-Definite Program (SDP) with objective function 
F{X) — Tr(LX), where X is an rt x n symmetric positive semi-definite matrix. This first non- 
trivial vector is, of course, of widespread interest in applications due to its usefulness for graph 
partitioning, image segmentation, data clustering, semi-supervised learning, etc. |l2][3j|4]|5j|6]|2l- 

In this context, Mahoney and Orecchia asked the question: do popular diffusion-based procedures — 
such as running the Heat Kernel or performing a Lazy Random Walk or computing the PageRank 
function — to compute a quick approximation to the first nontrivial eigenvector of L solve some 
other regularized version of the Rayleigh quotient objective function exactly! Understanding this 
algorithmic-statistical tradeoff is clearly of interest if one is interested in very large-scale applica- 
tions, where performing statistical analysis to derive an objective and then calling a black box solver 
to optimize that objective exactly might be too expensive. Mahoney and Orecchia answered the 
above question in the affirmative, with the interesting twist that the regularization is on the SDP 
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formulation rather than the usual vector optimization problem. That is, these three diffusion-based 
procedures exactly optimize a regularized SDP with objective function F{X) + ^G{X), for some 
regularization function G(-) to be described below, subject to the usual constraints. 

In this paper, we extend the Mahoney-Orecchia result by providing a statistical interpretation of 
their approximation procedure. Our interpretation will be analogous to the manner in which £2- 
regularized or ^i-regularized ^2-regression (often called Ridge regression and Lasso regression, 
respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on 
the coefficient vector of the regression problem. In more detail, we will set up a sampling model, 
whereby the graph Laplacian is interpreted as an observation from a random process; we will posit 
the existence of a "population Laplacian" driving the random process; and we will then define an 
estimation problem: find the inverse of the population Laplacian. We will show that the maximum a 
posteriori probability (MAP) estimate of the inverse of the population Laplacian leads to a regular- 
ized SDP, where the objective function F{X) — Tr{LX) and where the role of the penalty function 
G(-) is to encode prior assumptions about the population Laplacian. In addition, we will show that 
when G( ) is the log-determinant function then the MAP estimate leads to the Mahoney-Orecchia 
regularized SDP corresponding to running the PageRank heuristic. Said another way, the solutions 
to the Mahoney-Orecchia regularized SDP can be interpreted as regularized estimates of the pseu- 
doinverse of the graph Laplacian. Moreover, by Mahoney and Orecchia's main result, the solution 
to this regularized SDP can be computed very quickly — ^rather than solving the SDP with a black- 
box solver and rather computing explicitly the pseudoinverse of the Laplacian, one can simply run 
the fast diffusion-based PageRank heuristic for computing an approximation to the first nontrivial 
eigenvector of the Laplacian L. 

The next section describes some background. Section [3] then describes a statistical framework for 
graph estimation; and Section [4] describes prior assumptions that can be made on the population 
Laplacian. These two sections will shed light on the computational implications associated with 
these prior assumptions; but more importantly they will shed light on the implicit prior assumptions 
associated with making certain decisions to speed up computations. Then, Section [5] will provide 
an empirical evaluation, and Section |6] will provide a brief conclusion. Additional discussion is 
available in the Appendix. 

2 Background on Laplacians and diffusion-based procedures 

A weighted symmetric graph G is defined by a vertex set V — {1, ... , n}, an edge set E C V x V, 
and a weight function w : E ^ M+, where w is assumed to be symmetric (i.e., w{u, v) — w{v, u)). 
In this case, one can construct a matrix, Lq E M^^^, called the combinatorial Laplacian of G: 



where d{u) ~ w{u,v) is called the degree of u. By construction, Lq is positive semidefinite. 
Note that the all-ones vector, often denoted 1, is an eigenvector of Lq with eigenvalue zero, i.e., LI = 
0. For this reason, 1 is often called trivial eigenvector of Lq. Letting D he a diagonal matrix with 
D{u, u) — d{u), one can also define a normalized version of the Laplacian: L = D^^^^LqD^^^^. 
Unless explicitly stated otherwise, when we refer to the Laplacian of a graph, we will mean the 
normalized Laplacian. 

In many situations, e.g., to perform spectral graph partitioning, one is interested in computing the 
first nontrivial eigenvector of a Laplacian. Typically, this vector is computed "exactly" by calling 
a black-box solver; but it could also be approximated with an iteration-based method (such as the 
Power Method or Lanczos Method) or by running a random walk-based or diffusion-based method 
to the asymptotic state. These random walk-based or diffusion-based methods assign positive and 
negative "charge" to the nodes, and then they let the distribution of charge evolve according to 
dynamics derived from the graph structure. Three canonical evolution dynamics are the following: 

Heat Kernel. Here, the charge evolves according to the heat equation — —LHt. Thus, the 

vector of charges evolves as Ht = cxp (—tL) = X^fc^o ''^k\ where t > is a time 
parameter, times an input seed distribution vector. 

PageRank. Here, the charge at a node evolves by either moving to a neighbor of the current node 
or teleporting to a random node. More formally, the vector of charges evolves as 




—w{u, v) when u ^ v, 

d{u) — w{u, u) otherwise. 
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where M is the natural random walk transition matrix associated with the graph and where 
7 e (0, 1) is the so-called teleportation parameter, times an input seed vector 

Lazy Random Walk. Here, the charge either stays at the current node or moves to a neighbor 
Thus, if M is the natural random walk transition matrix associated with the graph, then the 
vector of charges evolves as some power of Wa = a/ + (1 — a)M, where a G (0, 1) 
represents the "holding probability," times an input seed vector. 

In each of these cases, there is a parameter (t, 7, and the number of steps of the Lazy Random 
Walk) that controls the "aggressiveness" of the dynamics and thus how quickly the diffusive process 
equilibrates; and there is an input "seed" distribution vector. Thus, e.g., if one is interested in global 
spectral graph partitioning, then this seed vector could be a vector with entries drawn from {—1, +1} 
uniformly at random, while if one is interested in local spectral graph partitioning |[8]|9l[l0l[lT], then 
this vector could be the indicator vector of a small "seed set" of nodes. See Appendix [A] for a brief 
discussion of local and global spectral partitioning in this context. 

Mahoney and Orecchia showed that these three dynamics arise as solutions to SDPs of the form 

mim^ize Tr(LX) + )^G{X) 

subject to X 

Tr(X) = 1, 

XD^/"^! = 0, 

where G is a penalty function (shown to be the generaUzed entropy, the log-determinant, and a 
certain matrix-p-norm, respectively 11]) and where 77 is a parameter related to the aggressiveness 
of the diffusive process [IJ. Conversely, solutions to the regularized SDP of (|2]i for appropriate 
values of 77 can be computed exactly by running one of the above three diffusion-based procedures. 
Notably, when = 0, the solution to the SDP of (j2|l is uu\ where u is the smallest nontrivial 
eigenvector of L. More generally and in this precise sense, the Heat Kernel, PageRank, and Lazy 
Random Walk dynamics can be seen as "regularized" versions of spectral clustering and Laplacian 
eigenvector computation. Intuitively, the function G(-) is acting as a penalty function, in a manner 
analogous to the £2 or £1 penalty in Ridge regression or Lasso regression, and by running one of 
these three dynamics one is implicitly making assumptions about the form of G( ). In this paper, we 
provide a statistical framework to make that intuition precise. 

3 A statistical framework for regularized graph estimation 

Here, we will lay out a simple Bayesian framework for estimating a graph Laplacian. Importantly, 
this framework will allow for regularization by incorporating prior information. 

3.1 Analogy with regularized linear regression 

It will be helpful to keep in mind the Bayesian interpretation of regularized linear regression. In 
that context, we observe n predictor-response pairs in x M, denoted (xi, yi), . . . , (xn, i/„); the 
goal is to find a vector /? such that j3'xi « Ui. Typically, we choose /3 by minimizing the residual 
sum of squares, i.e., F{f3) = RSS(/3) — \\yi — P'xi\W, or a penalized version of it. For Ridge 
regression, we minimize F{ji) + A||/3||2; while for Lasso regression, we minimize F{f3) + A||/3|| 1. 

The additional terms in the optimization criteria {i.e., A||/3||| and A||/3|ji) are called penalty func- 
tions; and adding a penalty function to the optimization criterion can often be interpreted as incor- 
porating prior information about /3. For example, we can model j/i, . . . , y„ as independent random 
observations with distributions dependent on j3. Specifically, we can suppose yi is a Gaussian ran- 
dom variable with mean jS'xi and known variance a^. This induces a conditional density for the 
vector y = . . .,y„): 

p(y |/3)(xexp{-2i^F(/3)}, (3) 

where the constant of proportionality depends only on y and a. Next, we can assume that (3 itself 
is random, drawn from a distribution with density p{f3). This distribution is called a prior, since it 
encodes prior knowledge about /?. Without loss of generality, the prior density can be assumed to 
take the form 

p(/3) cx exp{-t/(/3)}. (4) 
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Since the two random variables are dependent, upon observing y, we have information about /?. This 
information is encoded in the posterior density, p(/3 | y), computed via Bayes' rule as 

p(/3 I y) cx p{y \ fi)p{fi) cx eM-^F[P) - C/(/3)}. (5) 

The MAP estimate of /3 is the value that maximizes p{j3 \ y); equivalently, it is the value of f3 
that minimizes — logp(/3 | y). In this framework, we can recover the solution to Ridge regres- 
sion or Lasso regression by setting C/(/3) = 2ct2II^II2 ^(/^) ~ 2ct2|I/^IIi' '"^^p^'-'-i^^iy- Thus, 

Ridge regression can be interpreted as imposing a Gaussian prior on /3, and Lasso regression can be 
interpreted as imposing a double-exponential prior on (3. 

3.2 Bayesian inference for the population Laplacian 

For our problem, suppose that we have a connected graph with n nodes; or, equivalently, that we 
have L, the normalized Laplacian of that graph. We will view this observed graph Laplacian, L, 
as a "sample" Laplacian, i.e., as random object whose distribution depends on a true "population" 
Laplacian, C. As with the linear regression example, this induces a conditional density for L, to be 
denoted p{L \ C). Next, we can assume prior information about the population Laplacian in the 
form of a prior density, p{C); and, given the observed Laplacian, we can estimate the population 
Laplacian by maximizing its posterior density, p{C \ L). 

Thus, to apply the Bayesian formalism, we need to specify the conditional density of L given C. In 
the context of linear regression, we assumed that the observations followed a Gaussian distribution. 
A graph Laplacian is not just a single observation — it is a positive semidefinite matrix with a very 
specific structure. Thus, we will take L to be a random object with expectation C, where C is another 
normalized graph Laplacian. Although, in general, C can be distinct from L, we will require that the 
nodes in the population and sample graphs have the same degrees. That is, if d — ((i(l), . . . , d{n)) 
denotes the "degree vector" of the graph, and D — diag ((i(l), . . . , d{nj), then we can define 

A" = {X : X ^ 0, XD^/'^l ^ 0, rank(X) = n - 1}, (6) 

in which case the population Laplacian and the sample Laplacian will both be members of X. To 
model L, we will choose a distribution for positive semi-definite matrices analogous to the Gaussian 
distribution: a scaled Wishart matrix with expectation £. Note that, although it captures the trait 
that L is positive semi-definite, this distribution does not accurately model every feature of L. For 
example, a scaled Wishart matrix does not necessarily have ones along its diagonal. However, the 
mode of the density is at C, a Laplacian; and for large values of the scale parameter, most of the 
mass will be on matrices close to C. Appendix [Bjprovides a more detailed heuristic justification for 
the use of the Wishart distribution. 

To be more precise, let to > n — 1 be a scale parameter, and suppose that L is distributed over X as 
a ^Wishart(£, m) random variable. Then, E[L \ C] ^ C, and L has conditional density 

PiL I C) cx ^, (7) 

where | • | denotes pseudodeterminant (product of nonzero eigenvalues). The constant of proportion- 
ality depends only on L, d, to, and n; and we emphasize that the density is supported on X. Eqn. (|7]) 
is analogous to Eqn. ^ in the linear regression context, with 1/to, the inverse of the sample size 
parameter, playing the role of the variance parameter a^. Next, suppose we have know that £ is a 
random object drawn from a prior density p{C). Without loss of generality, 

p(£) cx exp{-;7(£)}, (8) 

for some function U, supported on a subset X C X. Eqn. (|8]l is analogous to Eqn. (|4]) from the 
linear regression example. Upon observing L, the posterior distribution for £ is 

p{£ I L) cx p{L I £)p(£) « exp{-f Tr(L£+) + f log |£+| - U{C)}, (9) 

with support determined by X. Eqn. (|9]l is analogous to Eqn. Q from the linear regression example. 
If we denote by £ the MAP estimate of £, then it follows that £+ is the solution to the program 

minimize Tt{LX) + ^U{X+) ~ log \X\ 

X _ " (10) 

subject to X e X C X. 
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Note the similarity with Mahoney-Orecchia regularized SDP of O). In particular, if X = {X : 
Tr(X) = 1} n A", then the two programs are identical except for the factor of log \X\ in the opti- 
mization criterion. 



4 A prior related to the PageRank procedure 

Here, we will present a prior distribution for the population Laplacian that will allow us to leverage 
the estimation framework of Section[3] and we will show that the MAP estimate of C for this prior is 
related to the PageRank procedure via the Mahoney-Orecchia regularized SDP. Appendix[C]presents 
priors that lead to the Heat Kernel and Lazy Random Walk in an analogous way; in both of these 
cases, however, the priors are data-dependent in the strong sense that they explicitly depend on the 
number of data points. 

4.1 Prior density 

The prior we will present will be based on neutrality and invariance conditions; and it will be sup- 
ported on X, i.e., on the subset of positive-semidefinite matrices that was the support set for the 
conditional density defined in Eqn. (j7]i. In particular, recall that, in addition to being positive semi- 
definite, every matrix in the support set has rank rt — 1 and satisfies XD^^^l — 0. Note that because 
the prior depends on the data (via the orthogonality constraint induced by D), this is not a prior in the 
fully Bayesian sense; instead, the prior can be considered as part of an empirical or pseudo-Bayes 
estimation procedure. 

The prior we will specify depends only on the eigenvalues of the normalized Laplacian, or equiva- 
lently on the eigenvalues of the pseudoinverse of the Laplacian. Let = t OAO' be the spectral 
decomposition of the pseudoinverse of the normalized Laplacian £, where t > is a scale factor, 
O G jj'ix"-! is an orthogonal matrix, and A = diag (A(l), . . . , X{n — 1)), where X{v) = 1. 
Note that the values A(l), . . . , X{n — 1) are unordered and that the vector A = (A(l), . . . , X{n — 1)) 
lies in the unit simplex. If we require that the distribution for A be exchangeable (invariant under 
permutations) and neutral (A(w) independent of the vector (A(u)/(1 — X{v)) : u ^ v), for all 
v), then the only non-degenerate possibility is that A is Dirichlet-distributed with parameter vector 
(a, . . . , a) 1 12 |. The parameter a, to which we refer as the "shape" parameter, must satisfy a > 
for the density to be defined. In this case, 

n-l 

p{c)^p{T)\{x{vr-\ (11) 

where p(r) is a prior for r. Thus, the prior weight on C only depends on r and A. One implication 
is that the prior is "nearly" rotationally invariant, in the sense that p{P' CP) = p{C) for any rank- 
(n — 1) projection matrix P satisfying PD^/^1 ~ 0. 

4.2 Posterior estimation and connection to PageRank 



To analyze the MAP estimate associated with the prior of Eqn. (Ill and to explain its connection 
with the PageRank dynamics, the following proposition is crucial. 

Proposition 4.1. Suppose the conditional likelihood for L given L is as defined in (|7]l and the prior 
density for C is as defined in Define L to be the MAP estimate of C. Then, \Tt:{L'^)]~^ L'^ 

solves the Mahoney-Orecchia regularized SDP (|2|, with G{X) ~ — log|X| and rj as given in 



Eqn. {12\ below. 



Proof. For C in the support set of the posterior, define r — Tr(£+) and = r^^£+, so that 
Tr(0) = 1. Further, rank(0) = n — 1. Express the prior in the form of Eqn. ([8]) with function U 
given by 

U{C) = - log{p(T) |er-i} - -{a - 1) log |e| - logp(r), 

where, as before, | • | denotes pseudodeterminant. Using (|9]l and the relation |£+| = t"^^|0|, the 
posterior density for C given L is 

p{C I i) cx exp { - ^ Tr(Le) + -^±^^ bg |e| + ^(t)}, 
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where ^(t) = ™'^"-~^'> logr + logp(r). Suppose C maximizes the posterior HkeHhood. Define 
f = Tr(£+) and Q — [f]~^£+. In this case, 8 must minimize the quantity Tr(LG)) — ^ log |8|, 
where 



m + 2[a — 1) 

Thus e solves the regularized SDP Q with G{X) = - log \X\. □ 

Mahoney and Orecchia showed that the solution to (|2]i with G{X) — — log \X\ is closely related to 
the PageRank matrix, R^, defined in Eqn. ([T]i. By combining Proposition 4.1 with their result, we 



get that the MAP estimate of C satisfies £+ cx D^^^'^ R^D^/^; conversely, R-y oc D^^^C+D'^/"^. 
Thus, the PageRank operator of Eqn. ([TJ can be viewed as a degree-scaled regularized estimate of 
the pseudoinverse of the Laplacian. Moreover, prior assumptions about the spectrum of the graph 
Laplacian have direct implications on the optimal teleportation parameter. Specifically Mahoney 
and Orecchia's Lemma 2 shows how rj is related to the teleportation parameter 7, and Eqn. ( [T2] ) 
shows how the optimal 77 is related to prior assumptions about the Laplacian. 

5 Empirical evaluation 

In this section, we provide an empirical evaluation of the performance of the regularized Laplacian 
estimator, compared with the unregularized estimator. To do this, we need a ground truth population 
Laplacian C and a noisily-observed sample Laplacian L. Thus, in Section [5T| we construct a family 
of distributions for C; importantly, this family will be able to represent both low-dimensional graphs 
and expander-like graphs. Interestingly, the prior of Eqn. ( [TT| captures some of the qualitative 



features of both of these types of graphs (as the shape parameter is varied). Then, in Section 5.2 
we describe a sampling procedure for L which, superficially, has no relation to the scaled Wishart 
conditional density of Eqn. Q. Despite this model mis specification, the regularized estimator L,, 
outperforms L for many choices of the regularization parameter 77. 

5.1 Ground truth generation and prior evaluation 

The ground truth graphs we generate are motivated by the Watts-Strogatz "small-world" model 1 13 |. 
To generate a ground truth population Laplacian, £ — equivalently, a population graph — we start 
with a two-dimensional lattice of width w and height h, and thus n — wh nodes. Points in the lattice 
are connected to their four nearest neighbors, making adjustments as necessary at the boundary. We 
then perform s edge-swaps: for each swap, we choose two edges uniformly at random and then 
we swap the endpoints. For example, if we sample edges ii ~ ji and 12 ~ j2, then we replace 
these edges with ii ~ j2 and 12 ~ ji- Thus, when s = 0, the graph is the original discretization 
of a low-dimensional space; and as s increases to infinity, the graph becomes more and more like 
a uniformly chosen 4-regular graph (which is an expander 1 14 1 and which bears similarities with 
an Erdos-Renyi random graph |15|). Indeed, each edge swap is a step of the Metropolis algorithm 
toward a uniformly chosen random graph with a fixed degree sequence. For the empirical evaluation 
presented here, h — 7 and w — 6; but the results are qualitatively similar for other values. 

Figure [T| compares the expected order statistics (sorted values) for the Dirichlet prior of Eqn. ( [TT| ) 
with t he exp ected eigenvalues of 8 = £+/Tr(£+) for the small-world model. In particular, in 
Figure [T(a)[ we show the behavior of the order statistics of a Dirichlet distribution on the (n — 1)- 
dimensional simplex with scalar shape parameter a, as a function of a. For each value of the 
shape a, we generated a random (n — 1) -dimensional Dirichlet vector. A, with parameter vector 
(a, . . . , a); we computed the n — 1 order statistics of A by sorting its components; and we repeated 



this procedure for 500 replicates and averaged the values. Figure 1(b) shows a corresponding plot 
for the ordered eigenvalues of 8. For each value of s (normalized, here, by the number of edges /i, 
where /i — 2wh ~ w — h — 71), we generated the normalized Laplacian, C, corresponding to the 
random s-edge-swapped grid; we computed the n — 1 nonzero eigenvalues of 8; and we performed 
1000 replicates of this procedure and averaged the resulting eigenvalues. 

Interestingly, the behavior of the spectrum of the small-world model as the edge-swaps increase is 
qualitatively quite similar to the behavior of the Dirichlet prior order statistics as the shape param- 
eter a increases. In particular, note that for small values of the shape parameter a the first few 
order-statistics are well- separated from the rest; and that as a increases, the order statistics become 
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0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 

Shape ^ Swaps /Edges 

(a) Dirichlet distribution order statistics. (b) Spectrum of the inverse Laplacian. 



Figure 1: Analytical and empirical priors. 1(a) shows the Dirichlet distribution order statistics versus 
the shape parameter; and |l(b)| shows the spectrum of 8 as a function of the rewiring parameter. 

concentrated around 1/(71 — 1). Similarly, when the edge-swap parameter s = 0, the top two eigen- 
values (corresponding to the width-wise and height-wise coordinates on the grid) are well-separated 
from the bulk; as s increases, the top eigenvalues quickly merge into the bulk; and eventually, as s 
goes to infinity, the distribution becomes very close that that of a uniformly chosen 4-regular graph. 

5.2 Sampling procedure, estimation performance, and optimal regularization behavior 

Finally, we evaluate the estimation performance of a regularized estimator of the graph Laplacian 
and compare it with an unregularized estimate. To do so, we construct the population graph Q and its 



Laplacian C, for a given value of s, as described in Section 5.1 Let /i be the number of edges in Q. 
The sampling procedure used to generate the observed graph G and its Laplacian L is parameterized 
by the sample size m. (Note that this parameter is analogous to the Wishart scale parameter in 
Eqn. (|7|, but here we are sampling from a different distribution.) We randomly choose m edges with 
replacement from Q; and we define sample graph G and coiTesponding Laplacian L by setting the 
weight of i ~ j equal to the number of times we sampled that edge. Note that the sample graph G 
over-counts some edges in Q and misses others. 

We then compute the regularized estimate up to a constant of proportionality, by solving (implic- 
itly!) the Mahoney-Orecchia regularized SDP Q with G{X) = — log \X\. We define the unregular- 
ized estimate L to be equal to the observed Laplacian, L. Given a population Laplacian C, we define 
T = t{£) = Tr(£+) and G = 0(£) — t^^C^. We define f,,, f, 9r,, and Q similarly to the popu- 
lation quantities. Our performance criterion is the relative Frobenius eiTor ||9 — 8,,||f/||0 — 6||f, 
where || • ||f denotes the Frobenius norm (PIIf = [Tr(A'A)]i/2). Appendix |d] presents similar 
results when the performance criterion is the relative spectral norm error. 



Figures 2(a) 2(b) and 2(c) show the regularization performance when s = 4 (an intermediate 
value) for three different values of m/ fi. In each case, the mean error and one standard deviation 
around it are plotted as a function of 77/f, as computed from 100 replicates; here, f is the mean 
value of r over all replicates. The implicit regularization clearly improves the performance of the 
estimator for a large range of 77 values. (Note that the regularization parameter in the regularized 
SDP (pl is 1/77, and thus smaller values along the X-axis coiTespond to stro nger r egularization.) In 



particular, when the data are very noisy, e.g., when m/^ = 0.2, as in Figure 2(a) improved results 



are seen only for very strong regularization; for intermediate levels of noise, e.g., 777/// = 1.0, as in 
2(b)| (in which case m is chosen such that G and Q have the same number of edges counting 



Figure 
multip 

of noise. Figure |2(c)| illustrates that improved results are obtained for moderate levels of implicit 



icity), improved performance is seen for a wide range of values of 77; and for low levels 



regularization. Figures [2(d)| and [2(e)] illustrate similar results for s = and s ~ 32. 





(d) m//i = 2.0 and s = 0. 



(e) m//i — 2.0 and s = 32. 



(f) Optimal 77* /r. 



Figure 2: Regularization performance. 2(a) through |2(e) plot the relative Frobenius norm error, 
versus the (normalized) regularization parameter rj/f. Shown are plots for various values of the 
(normalized) number of edges, m/a, and the edge-swap parameter, s. Recall that the regularization 
parameter in the regula rized SDP B is I/77, and thus smaller values along the X-axis correspond to 
stronger regularization. 2(f) plots the optimal regularization parameter 77* /f as a function of sample 
proportion for different fractions of edge swaps. 



As when regularization is implemented explicitly, in all these cases, we observe a "sweet spot" 



where there is an optimal value for the implicit regularization parameter Figure 2(f) illustrates 
how the optimal choice of rj depends on parameters defining the population Laplacians and sample 
Laplacians. In particular, it illustrates how rj*, the optimal value of rj (normalized by f), depends 
on the sampling proportion m//i and the swaps per edges s/fi. Observe that as the sample size m 
increases, if converges monotonically to f; and, further, that higher values of s (corresponding to 
more expander-like graphs) correspond to higher values of rf. Both of these observations are in 
direct agreement with Eqn. ( [T2| i. 

6 Conclusion 

We have provided a statistical interpretation for the observation that popular diffusion-based proce- 
dures to compute a quick approximation to the first nontrivial eigenvector of a data graph Laplacian 
exactly solve a certain regularized version of the problem. One might be tempted to view our re- 
sults as "unfortunate," in that it is not straightforward to interpret the priors presented in this paper 
Instead, our results should be viewed as making explicit the implicit prior assumptions associated 
with making certain decisions (that are already made in practice) to speed up computations. 



Several extensions suggest themselves. The most obvious might be to try to obtain Proposition 4.1 
with a more natural or empirically-plausible model than the Wishart distribution; to extend the em- 
pirical evaluation to much larger and more realistic data sets; to apply our methodology to other 
widely-used approximation procedures; and to characterize when implicitly regularizing an eigen- 
vector leads to better statistical behavior in downstream applications where that eigenvector is used. 
More generally, though, we expect that understanding the algorithmic-statistical tradeoffs that we 
have illustrated will become increasingly important in very large-scale data analysis applications. 
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A Relationship with local and global spectral graph partitioning 



In this section, we briefly describe the connections between our results and global and local versions 
of spectral partitioning, which were a starting point for this work. 

The idea of spectral clustering is to approximate the best partition of the vertices of a connected 
graph into two pieces by using the nontrivial eigenvectors of a Laplacian (either the combinatorial 
or the normalized Laplacian). This idea has had a long history iiSl [17 2 3 4|. The simplest 
version of spectral clustering involves computing the first nontrivial eigenvector (or another vector 
with Rayleigh quotient close to that of the first nontrivial eigenvector) of L, and "sweeping" over 
that vector For example, one can take that vector, call it x, and, for a given threshold K, define the 
two sets of the partition as 

C{x,K) ^ {i : x,> K}, and 
C{x,K) ^{i:x,< K}. 

Typically, this vector x is computed by calling a black-box solver, but it could also be approximated 
with an iteration-based method (such as the Power Method or Lanczos Method) or a random walk- 
based method (such as running a diffusive procedure or PageRank-based procedure to the asymptotic 
state). Far from being a "heuristic," this procedure provides a global partition that (via Cheeger's in- 
equality and for an appropriate choice of K) satisfies provable quality-of-approximation guarantees 
with respect to the combinatorial optimization problem of finding the best "conductance" partition 
in the entire graph lfT8ll2l[3l. 

Although spectral clustering reduces a graph to a single vector — the smallest nontrivial eigenvector 
of the graph's Laplacian — and then clusters the nodes using the information in that vector, it is pos- 
sible to obtain much more information about the graph by looking at more than one eigenvector of 
the Laplacian. In particular, the elements of the pseudoinverse of the combinatorial Laplacian, Lq, 
give local {i.e., node-specific) information about random walks on the graph. The reason is that the 
pseudoinverse Lq of the Laplacian is closely related to random walks on the graph. See, e.g. fT9\ for 
details. For example, it is known that the quantity Lq{u, u) + Lq{v, v) — Lq{u, v) — Lq{v, u) is pro- 
portional to the commute time, a symmetrized version of the length of time before a random walker 
started at node u reaches node v, whenever u and v are in the same connected component [,20J. Simi- 
larly, the elements of the pseudoinverse of the normalized Laplacian give degree-scaled measures of 
proximity between the nodes of a graph. It is likely that v) has a probabilistic interpretation 

in terms of random walks on the graph, along the lines of our methodology, but we are not aware 
of any such interpretation. From this perspective, given and a cutoff value, K, we can define a 
local partition around node u via Pk{u) ~ {v : L^{u, v) > K}. (Note that if v is in Pk{u), then 
u is in Pfc(u); in addition, if the graph is disconnected, then there exists a K such that u and v are 
in the same connected component iff u e Pk{u).) We call clustering procedures based on this idea 
local spectral partitioning. 

Although the naive way of performing this local spectral partitioning, i.e., to compute L+ explicitly, 
is prohibitive for anything but very small graphs, these ideas form the basis for very fast local spectral 
clustering methods that employ truncated diffusion-based procedures to compute localized vectors 
with which to partition. For example, this idea can be implemented by performing a diffusion- 
based procedure with an input seed distribution vector localized on a node u and then sweeping 
over the resulting vector. This idea was originally introduced in |8| as a diffusion-based operational 
procedure that was local in a very strong sense and that led to Cheeger-like bounds analogous to 
those obtained with the usual global spectral partitioning; and this was extended and improved 
by (HI Ho). In addition, an optimization perspective on this was provided by ifTTIl . Although [11] is 
local in a weaker sense, it does obtain local Cheeger-like guarantees from an explicit locally-biased 
optimization problem, and it provides an optimization ansatz that may be interpreted as a "local 
eigenvector." See lE] |9l [TOl [n] for details. Understanding the relationship between the "operational 
procedure versus optimization ansatz" perspectives was the origin of [1] and thus of this work. 

B Heuristic justification for the Wishart density 

In this section, we describe a sampling procedure for L which, in a very crude sense, leads approxi- 
mately to a conditional Wishart density forp(L | C). 

Let G be a graph with vertex set V = {1,2,..., n}, edge set E — V x V equipped with the 
equivalence relation {u,v) = {v,u). Let ut be an edge weight function, and let Co and £ be 



10 



the corresponding combinatorial and normalized Laplacians. Let A be a diagonal matrix with 
A{u,u) = so that £ = A~^/^£o^~^^^- Suppose the weights are scaled such that 

J2{u t))e£; ^) ^ suppose further that A{u, u) > 0. We refer to oj{u, v) as the population 

weight of edge {u, v). 

A simple model for the sample graph is as follows: we sample m edges from E, randomly chosen 
according to the population weight function. That is, we see edges {ui,vi), {u2,V2), ■ ■ ■ , (wm> ^m)> 
where the edges are all drawn independently and identically such that the probability of seeing edge 
(tt, v) is determined by u): 

Vuj{{ui,vi) = {u,v)} = u){u,v). 

Note that we will likely see duplicate edges and not every edge with a positive weight will get 
sampled. Then, we construct a weight function from the sampled edges, called the sample weight 
function, w, defined such that 

^ m 

W{U, u) = — ^ l{{ui, Vi) = {u, v)}, 

i=l 

where 1{ } is an indicator vector. In turn, we construct a sample combinatorial Laplacian, Lq, 
defined such that 

Lo(u,v) = P^""^""'""^ whenu = t;, 
' \—w{u,v) otherwise. 

Let D be a diagonal matrix such that D{u, u) = w{u, v), and define L = D~^/'^LqD~^^'^. Let- 
ting E^j denote expectation with respect to the probabihty law P^^, note that ¥.t^[w{u, v)] = uj{u, v), 
that Et^Lo = and that E^D = A. Moreover, the strong law of large numbers guarantees that as 
m increases, these three quantities converge almost surely to their expectations. Further, Slutzky's 
theorem guarantees that ^/rn{L — C) and Y^A~^/^(io — £o)^~^^^ converge in distribution to 
the same limit. We use this large-sample behavior to approximate the the distribution of L by the 
distribution of A^^/^XqA^^/^. Put simply, we treat the degrees as known. 

The distribution of Lq is completely determined by the edge sampling scheme laid out above. How- 
ever, the exact form for the density involves an intractable combinatorial sum. Thus, we appeal to a 
crude approximation for the conditional density. The approximation works as follows: 

1. For i = 1, . . . , m, define Xi e M" such that 

{+,Sj when u = Ui, 
—Si when u = Vi, 
otherwise, 

where Sj G {—1, +1} is chosen arbitrarily. Note that Lo = ^ Y^T=i 

2. Take Sj to be random, equal to or — 1 with probability |. Approximate the distribution 
of Xi by the distribution of a multivariate normal random variable, Xi, such that Xi and Xi 
have the same first and second moments. 

3. Approximate the distribution of Lq by the distribution of Lq, where Lq = ^ S"=i ^i^'i- 

4. Use the asymptotic expansion above to approximate the distribution of L by the distribution 

of A-V2loA-i/2. 

The next two lemmas derive the distribution of Xi and Lq in terms of £, allowing us to get an 
approximation for p(L | £). 

Lemma B.l. With Xi and defined as above, 

Ea;N=E„[^i] = 0, 

and 

^ui[xix'i] = Et^[iiX-] = Cq. 

Proof. The random variable Xi is defined to have the same first and second moments as Xi. The first 
moment vanishes since Sj = — s, impUes that a;, = —Xi. For the second moments, note that when 

E„[a;i(u) Xi{v)\ = -s- Pc^{(ui,i;i) = {u,v)} = -w{u,v) = jCq{u,v). 
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Likewise, 

Lemma B.2. The random matrix Lq is distributed as ^Wishart(>Coj ™) random variable. This 
distribution is supported on the set of positive-semidefinite matrices with the same nullspace as Co- 
When m > rank(£o). the distribution has a density on this space given by 

rff ir ^ |Zo|("'— "'<(^)-i)/2exp{-f Tr(Lo/:+)} 
/(Lo I £0, m) cx ^^-p^ (13) 

where the constant of proportionality depends only on m and n and where \ ■ \ denotes pseudodeter- 
minant (product of nonzero eigenvalues). 

Proof. Since mL is a sum of m outer products of multivariate Normal(0, Cq), it is Wishart dis- 
tributed (by definition). Suppose rank(£o) — f and U G M"^'' is a matrix whose columns are the 

eigenvectors of £o- Note that U'xi = Normal(0, U'CqU), and that U'CqU has full rank. Thus, 
U'LqU has a density over the space of r x r positive-semidefinite matrices whenever m > r. The 
density of U'LU is exactly equal to /(£o | Cq, to), defined above. □ 



Using the previous lemma, the random variable L = A ^^'^LqA ^/-^ has density 
f{L \£,m) (X 



^l/2_£^l/2|(m-rank(£)-l)/2 ejcp{~fTT{A^/^LA^/^C+)} 



|AV2/:oAl/2|m/2 

where we have used that rank(£o) = rank(£), and the constant of proportionality depends on to, 
n, rank(/:), and A. Then, if we approximate \A^/^LA^/^\ « |A||i| and A^/^^+A^/^ « £+, then 
/ is "approximately" the density of a ^Wishart(£, m) random variable. These last approximations 
are necessary because L and Cq are rank-degenerate. 

To conclude, we do not want to overstate the validity of this heuristic justification. In particular, it 
makes three key approximations: 

1 . the true degree matrix A can be approximated by the observed degree matrix D; 

2. the distribution of Xi, a sparse vector, is well approximated Xi, a Gaussian (dense) vector; 

3. the quantities \ A^^'^LA^/'^\ and A^/2£+ai/2 ^.^^ replaced with |A||L| and C+. 

None of these approximations hold in general, though as argued above, the first is plausible if m is 
large relative to n. Likewise, since L and C are nearly full rank, the third approximation is likely 
not too bad. The biggest leap of faith is the second approximation. Note, e.g., that despite their first 
moments being equal, the second moments of Xix'^ and Xix'^ differ. 

C Other priors and the relationship to Heat Kernel and Lazy Random Walk 



There is a straightforward generalization of Proposition 4.1 to other priors. In this section, we state 
it, and we observe connections with the Heat Kernel and Lazy Random Walk procedures. 

Proposition C.l. Suppose the conditional likelihood for L given L is as defined in (|7]l and the prior 
density for L is of the form 

p{L) cx p(T)|er™/2 exp{-g(T) G(e)}, (14) 

where t — Tr(£"'"), Q — t^^C^ , and p and q are functions with q{T) > over the support of the 
prior Define C to be the MAP estimate of C. Then, \Tt:{ C'^)\ ~^ L'^ solves the Mahoney-Orecchia 
regularized SDP (|2]i, with G the same as in the expression AMI for p{C) and with 



where f = Tr(>C"'"). 



12 



The proof of this proposition is a straightforward generalization of th e proof of Proposition 4. 1 and 



is thus omitted. Note that we recover the result of Proposition 4.1 by setting G(6) — — log |0| 
and q{T) = ^ + a — 1. In addition, by choosing G(-) to be the generalized entropy or the matrix 
p-norm penalty of [l ], we obtain variants of the Mahon ey-O recchia regularized SDP Q with the 



regularization term G( ). By then combining Proposition C.l with their result, we get that the MAP 
estimate of £ is related to the Heat Kernel and Lazy Random Walk procedures, respectively, in a 
manner analogous to what we saw in Section|4]with the PageRank procedure. In both of these other 
cases, however, the prior is data-dependent in the strong sense that it explicitly depends on the 
number of data points; and, in addition, the priors for these other cases do not correspond to any 
well-recognizable parametric distribution. 

D Regularization performance with respect to the relative spectral error 

In this section, we present Figure |3] which shows the regularization performance for our empirical 
evaluation, when the performance criterion is the relative spectral norm error, i.e., ||9 — 0^||2/||6 — 
OII2, where || • II2 denotes spectral norm of a matrix (which is the largest singular value of that 
matrix). See Section [J!2] for details of the setup. Note that these results are very similar to those for 
the relative Frobenius norm error that are presented in Figure |2] 





0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 

Regularization Regularization Reguiarization 



(d) m//i = 2.0ands = 0. (e) m//i = 2.0 and s = 4. (f) m/^i = 2.0 and s = 32. 

Figure 3: Regularization performance, as measured with the relative spectral norm error, versus the 
(normalized) regularization parameter rj/f. Shown are plots for various values of the (normalized) 
number of edges, m/fi, and the edge-swap parameter, s. Recall that the regularization parameter 
in the regularized SDP (j2|i is 1/r], and thus smaller values along the X-axis correspond to stronger 
regularization. 
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